Skip to content

fix: surface test-stdout.txt in verifier exception for dep_install classification (#540)#572

Open
ElegantLin wants to merge 2 commits into
mainfrom
fix/540-verifier-dep-install
Open

fix: surface test-stdout.txt in verifier exception for dep_install classification (#540)#572
ElegantLin wants to merge 2 commits into
mainfrom
fix/540-verifier-dep-install

Conversation

@ElegantLin
Copy link
Copy Markdown
Contributor

@ElegantLin ElegantLin commented May 27, 2026

Summary

  • Root cause: RewardFileNotFoundError in verifier.py only included file paths in its message, never pip's actual error output. The classify_verifier_error function had dep_install markers ("no solution found", etc.) but could never match them because they lived in test-stdout.txt, not the exception.
  • Fix: When raising RewardFileNotFoundError, tail the last 30 lines of test-stdout.txt into the exception message. This lets classify_verifier_error see pip's output and correctly return verifier_dep_install.
  • No changes to scoring.py — the classifier logic and markers were already correct (added in prior PRs); only the wiring was broken.

Test plan

  • 4 new parametrized cases in test_classify_verifier_error covering all dep_install markers in exception messages
  • test_dep_install_takes_precedence_over_generic_crash — precedence test
  • TestRewardFileNotFoundSurfacesStdout — 4 end-to-end wiring tests: stdout tail present, classifier fires, missing file graceful, non-dep-install stays verifier_failure
  • All 96 verifier+scoring tests pass
  • ruff clean

Closes #540

🤖 Generated with Claude Code


Open in Devin Review

…l classifier fires (#540)

The verifier_error_category=dep_install classification could never fire
in production because RewardFileNotFoundError only contained file paths,
not pip's actual error output. Now the exception tails test-stdout.txt
(last 30 lines) into the message, letting classify_verifier_error see
markers like "no solution found" from real pip failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment thread src/benchflow/task/verifier.py Outdated
f"{self._rollout_paths.reward_json_path}"
)
if stdout_tail:
msg += f"\n--- test-stdout.txt (last 30 lines) ---\n{stdout_tail}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Hardcoded "30" in error message drifts from _TAIL_LINES constant

The error message at src/benchflow/task/verifier.py:317 uses a hardcoded string "last 30 lines" instead of referencing the _TAIL_LINES constant defined at src/benchflow/task/verifier.py:63. If _TAIL_LINES is ever changed (e.g. to 50), _tail_file will return a different number of lines but the error message will still claim "last 30 lines", misleading anyone triaging verifier failures.

Suggested change
msg += f"\n--- test-stdout.txt (last 30 lines) ---\n{stdout_tail}"
msg += f"\n--- test-stdout.txt (last {_TAIL_LINES} lines) ---\n{stdout_tail}"
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@bingran-you bingran-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thermo-nuclear review verdict: request changes.

The dependency-install classification improvement is useful, and the real Claude ACP smoke passed, but the current implementation surfaces raw verifier stdout through RewardFileNotFoundError / verifier_error. That widens the secret-leak blast radius because test-stdout.txt can contain env dumps, install URLs, tokens, or other untrusted subprocess output that later lands in summaries/dashboards.

Please redact before appending stdout tails, or keep raw stdout out of result metadata and classify from a private diagnostic path instead.

Additional maintainability issues:

  • _tail_file() reads/splits the whole file to return 30 lines. Please stream-tail it, e.g. with collections.deque(f, maxlen=n).
  • tests/test_verify.py grows past 1k lines. Please move these regression tests to a focused verifier-output module.
  • Regression test docstrings should name the reviewed PR/fix explicitly, e.g. PR #572 / #540, per repo convention.

Verification notes:

  • Branch-head checks passed: uv sync, full pytest, ty, ruff.
  • Current-merge pytest and ty passed.
  • Current-merge ruff failed only on existing main baseline UP041 in src/benchflow/sandbox/docker.py:408, not this PR.
  • Real claude-agent-acp Docker smoke passed: reward 1.0, trajectory_source=acp, 5 ACP events, 1 tool call, usage provider_response, total tokens 64349, llm_trajectory.jsonl present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENG-151 verifier_error_category=dep_install classification cannot fire in production — classifier markers never match the production emitter

2 participants